Regulatory Repositories

Coline Zeballos Roche

Yann Féat mainanalytics

A Universal Conundrum

There are n packages for x, which one is the best?1

A Universal Conundrum

By choosing packages, we’re choosing our 1

  • Feature set
  • Dependency footprint
  • Integration with other packages
  • Preferred lifecycle management of our tools
  • Community that we can lean on for help

A Universal Conundrum

Regulated Industries: Justification as a Requirement

Goals

  • Provide a community-maintained catalog of package quality indicators (“risk metrics”)
  • Serve quality indicators in a standard format
  • Thoroughly document the system used to perform quality assessment
  • Demonstrate how regulatory-ready risk assessments can be provided using public quality indicators
  • Serve subsets of packages that conform to a specified risk tolerance
  • Improve transparency of industry R package adoption, endorsement and regulator interaction

An evolving R ecosystem

  • (NOTE: show interaction between CRAN, RVH Reg R Repo (us), RC Submissions WG, RC Repositories WG, pharmaverse, other?)

Pilot Implementation

focus on proving capabilities, quick development

Package: praise
Version: 1.2.3
DownloadURL: 
  github.com/cran/praise.tar.gz
code_coverage: 0.75

Package: survfit
Version 2.3.4
DownloadURL:
  github.com/cran/survfit.tar.gz
code_coverage: 0.87

Package repository, built on CRAN mirror & GitHub actions
r-hub/repos
Pre-calculated {riskmetric} scores
{riskscore}
PACKAGES
Manually Join Data
library(pharmapkgs)
options(available_packages_filter = 
  risk_filter(code_coverage > 0.8))
available.packages()
pak::pkg_install("survfit")

Interacting with the repo

Packages risk filters

  • Helper package for system administrators
  • Restricts packages available for installation to those fitting a policy
  • Uses packages metadata in the repo
  • May be used together with manual checks (e.g., read a statistical review)

flowchart TD
  A[All packages] --> B{Code\n coverage\n > 95%?}
  B -- Yes --> C{Has\n doc.?}
  C -- Yes --> D(Available for safety-critical activities)

Usage

Unfiltered

available.packages()
Package
1 colorspace
2 farver
3 isoband
106 tripack

Filtered

fltr <- risk_filter(covr_coverage > 0.95
  && has_vignettes)
options(available_packages_filters = fltr)
available.packages()
Package
1 colorspace
2 magrittr
3 R6
32 shinyjs

Repository ‘back-end’

Infrastructure setup

  • Hosts risk assessment metadata
  • Integrates with pak::pkg_install
  • Supports multiple levels of risk tolerance
  • Should allow for reproducible package installation to support historical analysis (snapshot, etc.)

DCF file forked from r-hub/repo

Package: bslib
Version: 0.6.1
Depends: R (>= 2.10), R (>= 4.4), R (< 4.4.99)
License: MIT + file LICENSE
Built: R 4.4.0; ; 2023-11-29 16:39:06 UTC; unix
RVersion: 4.4
Platform: x86_64-pc-linux-gnu-ubuntu-22.04
Imports: base64enc, cachem, grDevices, htmltools (>= 0.5.7), jquerylib (>= 0.1.3),
         jsonlite, lifecycle, memoise (>= 2.0.1), mime, rlang, sass (>= 0.4.0)
DownloadURL: https://github.com/cran/bslib/releases/download/0.7.0/bslib_0.7.0_b2_R4.5_x86_64-pc-linux-gnu-ubuntu-22.04.tar.gz
...

Added fields for risk-based assessment

riskmetric_run_date: 2023-06-21
riskmetric_version: 0.2.1
pkg_score: 0.29
covr_coverage: 0.85
has_vignettes: 1
remote_checks: 0.84
...

Packages cohort validation workflow

Risk assessment pipeline

  • Calculate package QA metadata on updated packages and their reverse dependencies
  • Produces logs and other reproducibility data
  • Should be able to additionally run on private infrastructure

Packages cohort validation workflow

D pkg_1 pkg_1 Version: 1.15 covr_coverage: 0.967 has_vignettes: 1 pkg_score: 0.359 pkg_2 pkg_2 Version: 3.5 covr_coverage: 0.984 has_vignettes: 1 pkg_score: 0.154 pkg_2->pkg_1 pkg_3 pkg_3 Version: 1.9 covr_coverage: 0.992 has_vignettes: 1 pkg_score: 0.312 pkg_3->pkg_1 pkg_3->pkg_2 pkg_4 pkg_4 Version: 0.5 covr_coverage: 0.864 has_vignettes: 0 pkg_score: 0.414 pkg_5 pkg_5 Version: 4.2 covr_coverage: 0.924 has_vignettes: 1 pkg_score: 0.234 pkg_5->pkg_4

Packages cohort validation workflow

D pkg_1 pkg_1 Version: 1.15 covr_coverage: ...    has_vignettes: ...    pkg_score: ...    pkg_2 pkg_2 Version: 3.6 covr_coverage: ...    has_vignettes: ...    pkg_score: ...    pkg_2->pkg_1 pkg_3 pkg_3 Version: 1.9 covr_coverage: 0.992 has_vignettes: 1 pkg_score: 0.312 pkg_3->pkg_1 pkg_3->pkg_2 pkg_4 pkg_4 Version: 0.5 covr_coverage: 0.864 has_vignettes: 0 pkg_score: 0.414 pkg_5 pkg_5 Version: 4.2 covr_coverage: 0.924 has_vignettes: 1 pkg_score: 0.234 pkg_5->pkg_4

Packages cohort validation workflow

D pkg_1 pkg_1 Version: 1.15 covr_coverage: 0.967 has_vignettes: 1 pkg_score: 0.314 pkg_2 pkg_2 Version: 3.6 covr_coverage: 0.987 has_vignettes: 1 pkg_score: 0.148 pkg_2->pkg_1 pkg_3 pkg_3 Version: 1.9 covr_coverage: 0.992 has_vignettes: 1 pkg_score: 0.312 pkg_3->pkg_1 pkg_3->pkg_2 pkg_4 pkg_4 Version: 0.5 covr_coverage: 0.864 has_vignettes: 0 pkg_score: 0.414 pkg_5 pkg_5 Version: 4.2 covr_coverage: 0.924 has_vignettes: 1 pkg_score: 0.234 pkg_5->pkg_4

Implementation: get new releases from Github

i <- 1
df_releases <- dplyr::tibble()
for (i in seq_along(gh_repos)) {
  user_repo_i <- gh_repos[i]
  url_old_i <- packages_old[i, "DownloadURL"]
  name_old_i <- packages_old[i, "File"]

  # fetch release assets from the Github API
  j <- 1
  ls_releases_i <- list()
  repeat {
    ls_releases_ij <- gh(sprintf("GET /repos/%s/releases", user_repo_i),
      per_page = 100, page = j
    )
    if (length(ls_releases_ij) == 0) {
      break
    }
    ls_releases_i <- c(ls_releases_i, ls_releases_ij)
    if (length(ls_releases_ij) < 100) {
      break
    }
    j <- j + 1
  }

  # add the list of release assets to a data frame
  df_releases <- ls_releases_i %>%
    lapply(function(ls_releases_ij) {
      lapply(ls_releases_ij[["assets"]], function(asset) {
        asset["uploader"] <- NULL
        asset
      }) %>%
        dplyr::bind_rows()
    }) %>%
    dplyr::bind_rows() %>%
    dplyr::mutate(
      user_repo = user_repo_i,
      url_old = url_old_i,
      name_old = name_old_i
    ) %>%
    rbind(df_releases)
}

Implementation: post-process the results

df_cmp_releases <- df_releases %>%
  dplyr::select(
    user_repo, name, created_at, browser_download_url, url_old,
    name_old
  ) %>%
  dplyr::rename(url = browser_download_url) %>%
  dplyr::filter(stringr::str_detect(
    .$name, stringr::coll("R4.4_x86_64-pc-linux-gnu-ubuntu-22.04")
  )) %>%
  dplyr::mutate(
    created_at = lubridate::as_datetime(created_at, format = "%Y-%Om-%dT%H:%M:%SZ"),
    libc = stringr::str_detect(url, "\\-libc")
  ) %>%
  dplyr::arrange(user_repo, libc, dplyr::desc(created_at)) %>%
  dplyr::group_by(user_repo) %>%
  dplyr::slice_head(n = 1) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    pkg = stringr::str_split_i(user_repo, "/", 2),
    ver = stringr::str_split_i(name, "_", 2),
    ver_old = stringr::str_extract(url_old, "download/(.+)/", group = 1)
  ) %>%
  dplyr::select(pkg, name, name_old, ver, ver_old, url)

Implementation: update metrics

df_pkg_diff <- df_cmp_releases %>%
  dplyr::filter(ver != ver_old)

if (nrow(df_pkg_diff)) {
  # calculate risk metrics

  opt_repos_init <- getOption("repos")
  options(repos = c("CRAN" = "https://cran.rstudio.com"))

  df_pkg_metrics <- df_pkg_diff$pkg %>%
    riskmetric::pkg_ref(source = "pkg_cran_remote") %>%
    riskmetric::pkg_assess() %>%
    riskmetric::pkg_score() %>%
    cbind(df_pkg_diff) %>%
    dplyr::select(-c(pkg, ver, pkg_ref))

  options(repos = opt_repos_init)

  # replace PACKAGES file
  
  # ...
}

Our roadmap

What’s next

Automating up-to-date quality metrics to support sponsor risk assessment

Package: praise
Version: 1.2.3
DownloadURL: 
  github.com/cran/praise.tar.gz
code_coverage: 0.75

Package: survfit
Version 2.3.4
DownloadURL:
  github.com/cran/survfit.tar.gz
code_coverage: 0.87

Package repository, built on CRAN mirror & GitHub actions
r-hub/repos
Periodically re-calculate metrics for updated packages
pharmaR/repos
PACKAGES
library(pharmapkgs)
options(available_packages_filter = 
  risk_filter(code_coverage > 0.8))
available.packages()
pak::pkg_install("survfit")
risk_report("praise")
Reference Image

PDF

Reference container image(s)

  • Should mimic environments of companies and health authority reviewers
  • Integrates with most modern analytic workbench tools and an evaluation pipeline
  • To be used by the Regulatory R Repository for packages cohort validation

Closing

Impact

Community Grants & Sponsorships

Over USD $1.4 Million

Organizing Large Scale Collaborative Projects

R Validation Hub, R-Ladies

Co-Host Multidisciplinary Data Science Forums

Stanford Data Institute

Direct Support for Key R Events

R/Medicine, R/Pharma, useR!, LatinR, and more

Direct Worldwide Support for R User Groups

Join us

r-consortium.org

  • Help guide the future direction of the R language
  • Collaborate on cross industry initiatives
  • Raise your leadership profile in the R Community
  • Protect your investment in R while supporting the common good

Thank you

  • (NOTE: list of Core team members)